Cloud Computing I - Week02 - 2 - Membership

What is Group Membership List?

Mean Time to Failure (MTTF)

Target Settings

Process 'group' based systems
- clouds/datacenters
- replicated servers
- Distributed DBs
crash-stop/fail-stop process failures

Group Membership Service

Membership List is the list of all the processes that are currently running.
All the application queries, e.g. gossip, overlays, DHTs etc. keep in sync with this list
Membership Protocol governs the Membership list.
one of the challenges is that this membership protocol has to communicate, over unreliable medium which can drop or delay the packets.

Strongly Consistent Membership, e.g. computer synchrony, a well known distributed computing paradigm relies on this
Partial Consistent List
Weakly Consistent
Failure Detectors + Dissemination

Failure Detectors

Disitributed Failure Detectors: Properties

Completeness
- the failure should be detected eventually (that means, there is no time bound)
Accuracy
- there should be no false positive
Speed
Scale
Completeness and Accuracy is impossible together in lossy networks.

Failure Detector Properties


Completeness	Guranteed(almost always 100%)
Accuracy	Partial/Probabilistic gurantee(<100%)
Speed	Time
Scale	No Bottlenecks/Single Point of Failure Equal Load on each member Network Message Load

Centralised Heartbeating

p_i sends periodic heartbeat signals to p_j
Heartbeat is a no. containing sequence no.

Ring Hearbeat

sends heartbeats to both the left and the right neighbors
quality of heartbeat is same , sequence no.
Failure Condition
- if there are multiple failures they may go undetected

All-to-All Heartbeat

heartbeat is sent to all the processes
equal load per member
it is complete
problem:
- suppose there is one node p_j, which is slow,
- it may mark all the nodes as failed

Gossip-Style Membership

a variant to all to all heartbeating, just more robust
it has good accuracy properties

Gossip Style Failure Detection

if the heartbeat has not increased for more than T_fail seconds, the member is considered failed
and after T_cleanup seconds, it will delete the member from the list
Why do we have 2 times?
- because it is possible, than one node has deleted it's entry while other hasn't
- so, that deleted entry may get added again

Ananlysis/Discussion

What happens if gossip period T_gossip is decreased?
A single heartbeat takes O(log(N)) time to propagate

Which is the best failure detector?


Completeness	Guranteed always
Accuracy	Probability PM(T)
Speed	T Time units
Scale Equal Load on each member Network Message Load	*NL** compare this across platforms

All to all heartbeating

in case of NORMAL ALL-to-ALL HEARTBEATING

in case of Gossip-Based ALL-to-ALL HEARTBEATING

Gossip has higher load than the normal one

The best/optimal we can do!

worst case load L* (per member), as a function of T, PM(T), N
Independent Message Loss Probabiliity p_ml

not dependent on N

The problem is that, Gossip based is trying to do both Failure Detection and Dissemination* together.
So, the KEY is
- Separate the 2 components
- Use a non heartbeat-based Failure Detection Component

Another Probabilistic Failure Detector

Dissemination and suspicion

results matching ""

No results matching ""